WGS Upscaling - IT & Bioinformatics Evaluation

Data transfer, Data storage, Bioinformatics pipeline capacity
Author
Affiliation

GDx

Published

December 5, 2023

1 Background

GDx at OUSAMG is planning to upscale the WGS production to 4 x 48 samples or 2 x 48 + 1 x 96 samples per week.

This document evaluates the possible bottlenecks of IT & bioinformatics pipelines in following areas:

  1. Data transfer speed
  2. Data storage
  3. Pipeline capacity (Illumina DRAGEN)

2 IT && Bioinformatics

2.1 Data transfer speed

2.1.1 Collect Historical Data Transfer Records

To evaluate the data transfer speed, we collected the transfer time of all files that were transferred from NSC to TSD 2023-09-01 08:41:40 and 2023-11-28 13:26:06.

The nsc-exporter log and sequencer overview html files were ignored for simplicity.
         [,1]                                          
datetime "2023-11-07 16:43:57"                         
project  "wgs328"                                      
filename "Diag-wgs328-HG20232480C8646-DR.alignedMT.bam"
bytes    "24224887"                                    
seconds  "0.5"                                         
speed    "49170000"                                    
         [,1]                                            
datetime "2023-09-09 13:53:03"                           
project  "wgs310"                                        
filename "HG39186724-Bindevev-KIT-wgs_S5_R1_001.fastq.gz"
bytes    "35566960363"                                   
seconds  "480.6"                                         
speed    "70580000"                                      
         [,1]                                   
datetime "2023-10-01 20:49:12"                  
project  "wgs316"                               
filename "canvas_Diag-wgs316-HG25228024_std.vcf"
bytes    "174076"                               
seconds  "0"                                    
speed    "5860000"                              
         [,1]                              
datetime "2023-11-13 11:59:57"             
project  "EKG231107"                       
filename "Diag-EKG231107-HG87968533.sample"
bytes    "1550"                            
seconds  "0"                               
speed    "46090"                           
         [,1]                                                      
datetime "2023-09-22 01:29:10"                                     
project  "wgs313"                                                  
filename "Diag-wgs313-HG35436997-DR-Mitokon-v4.1.0.HTS.vedlegg.pdf"
bytes    "31313"                                                   
seconds  "0"                                                       
speed    "1089320"                                                 

2.1.2 Data Overview

         filesize
Min.        0.0 B
1st Qu.   428.0 B
Median    9.3 KiB
Mean      1.5 GiB
3rd Qu. 968.0 KiB
Max.    100.9 GiB

        speed(/s)
Min.        1.0 B
1st Qu.  12.0 KiB
Median  288.1 KiB
Mean     12.2 MiB
3rd Qu.   8.4 MiB
Max.     93.1 MiB

    seconds       
 Min.   :   0.00  
 1st Qu.:   0.00  
 Median :   0.00  
 Mean   :  19.56  
 3rd Qu.:   0.10  
 Max.   :2084.40  

2.1.3 Corelation Between File Size And Transfer Time And Transfer Speed

2.1.3.1 Transfer speed and time VS file size (all files)

2.1.3.2 Transfer speed and time VS file size (small files)

2.1.3.3 Maximum transfer reached around 200MB file size?

2.1.4 Idle Time

How much time when nsc-exporter is not transferring files?

2.1.4.1 September

logarithmic time

2.1.4.2 October

logarithmic time

2.1.4.3 November

logarithmic time

2.1.5 Discussion

  • The nsc-exporter is idle for quite a portion of the time.
    • Quite long idle time in September observed.
    • Almost 12 wgs projects were transferred in November.
  • The maximum transfer speed is reached around 200 MB file size. This is the configured chunk size of s3cmd which is the tool used for data transfer. We might want to increase the chunk size to improve the transfer speed?
  • The current transfer speed is not optimal considering the 10Gbps switch connecting NSC and TSD. We need to investigate the reason for the low transfer speed.

2.1.6 Conclusion

  • We might be able to run 4 x 48 or 2 x 48 + 1 x 96 samples per week with the current transfer speed. However, we will reach maximum capicity of data transfer.
  • If we can increase the transfer speed, e.g. reaching 200MB/s, we can easily double current production capacity.

2.2 Data storage

WGS produces large amount of data. The data storage capacity is critical for the upscaling.

2.2.1 NSC

On NSC side, the data is stored in on boston at /boston/diag. Boston has a total capacity of 1.5 PB, and the usable capacity is 1.2 at the moment.

2.2.2 TSD

On TSD side, the data is stored in /cluster/projects/p22. The total capacity is 1.8 PB, and the usable capacity is 1.2 PB at the moment.

2.3 Pipeline capacity (Illumina DRAGEN)

Illunima DRAGEN is a bioinformatics pipeline server that can be used to process WGS data. It takes around 1 hours to process a 30x WGS sample.

3 Discussion

To be addded…

4 Conclusion

To be added…